Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_13

13. Zero-Inflated Beta Models for Microbiome Data

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

This chapter introduces two specifically designed zero-inflated beta regression models for analyzing zero-inflated count microbiome data. First, it briefly introduces the zero-inflated beta modeling microbiome data. Then, it introduces the zero-inflated beta regression (ZIBSeq). Next, it introduces the zero-inflated beta-binomial model (ZIBB).

Keywords

Zero-inflated beta Zero-inflated beta regression (ZIBSeq) Zero-inflated beta-binomial model (ZIBB) Semi-continuous Zero-inflated continuous Zero-inflated beta regression model with random effects (ZIBR) Marginalized two-part beta regression (MTPBR) Likelihood ratio tests (LRT) ZIBBSeqDiscovery fitZIBB()

This chapter introduces two specifically designed zero-inflated beta regression models for analyzing zero-inflated count microbiome data. First we briefly introduce the zero-inflated beta modeling microbiome data (Sect. 13.1). Then we introduce the zero-inflated beta regression (ZIBSeq) and its implementation (Sect. 13.2). Next, we introduce the zero-inflated beta-binomial model (ZIBB) and its implementation (Sect. 13.3). Finally, we briefly summarize the zero-inflated beta regression in microbiome study (Sect. 13.4).

13.1 Zero-Inflated Beta Modeling Microbiome Data

One trend of statistical modeling microbiome data is to use two-part zero-inflated beta regression to detect differential abundance of microbes across clinical conditions or different treatments. The two-part zero-inflated beta regression is attractive because it takes the unique characteristics of microbiome data into account; that is, microbiome data are compositional (quantified by relative abundance), highly skewed, and bounded in [0, 1) and often have many zeros.

A “semi-continuous” or “zero-inflated continuous” data describes a kind of continuous data with a point mass at zero and a right-skewed continuous distribution with a positive values. The raw sequencing reads of microbiome data are absolute abundant counts. However, if the absolute abundant counts are transformed or normalized into relative abundance, then microbiome data can be characterized as a “semi-continuous” or “zero-inflated continuous” data; in which the zero values indicate that either the certain microbes are absent in the sample (structural zeros) or the rare microbes are present but missed due to under-sampling (sampling-zeros), while the continuous distribution of positive values describes the levels of relative abundance among the present microbes (Chai et al. 2018).

Typically, a two-part zero-inflated beta model consists of two sub-models: Part I uses a logistic regression to model the probability of a feature (taxon) having zero values, while Part II uses a beta regression to model the distribution of positive values, i.e., the relative abundance of feature (or taxon) conditionally presented.

So far several two-part beta regression models have been developed for modeling relative abundance of microbiome data, including zero-inflated beta regression (ZIBSeq) (Peng et al. 2016), zero-inflated beta-binomial model (ZIBB) (Hu et al. 2018), zero-inflated beta regression model with random effects (ZIBR) (Chen and Li 2016), and marginalized two-part beta regression (MTPBR) (Chai et al. 2018). Chai et al. (2018) thought that the regression coefficients in Part II in previous two-part zero-inflated beta models is not able to marginally (unconditionally) interpret the covariate effects on the microbial abundance, which is often the interest of microbiome researches. Thus, unlike BESeq and other two-part zero-inflated beta models, MTPBR was proposed to examine covariate effects on the overall marginal (unconditional) mean, while capturing the zero-inflation and skewness of microbiome data (Chai et al. 2018). MTPBR model was derived starting with the conventional two-part model with a beta component in Part II (Chen and Li 2016; Ospina and Ferrari 2012; Peng et al. 2016). Then, in order to obtain the impact of covariates on the overall marginal mean, MTPBR model reparameterizes the likelihood of the conventional two-part model and provides the different interpretation of covariate effects. The distinctive feature of MTPBR is its ability of modeling covariate effects on the marginal mean of the outcome. It was showed that in terms of controlling the type I error, and power as well as robustness, MTPBR has satisfactory performance and outperformed the conventional two-part (CTP) model using the likelihood ratio tests (LRT) for both simulated real data analyses (Chai et al. 2018).

In summary, all the proposed zero-inflated beta models use zero-inflated beta regression to fit relative abundant microbiome data; thus, they aim to take into account the compositional and zero-inflation nature of the microbiome data. However, all these models have their own advantages and limitations.

MTPBR is implemented via SAS NLMIXED procedure, and ZIBR is a longitudinal model that we have introduced in our previous book (2018) (Xia et al. 2018a); thus in this chapter we focus on introducing ZIBSeq and ZIBB.

13.2 Zero-Inflated Beta Regression (ZIBSeq)

ZIBSeq is a zero-inflated beta regression approach for differential abundance analysis of metagenomics data. It is an extension of the generalized linear model (GLM).

13.2.1 Introduction to ZIBseq

ZIBSeq (Peng et al. 2016) was developed to identify differentially abundant features between multiple clinical conditions while addressing the unique characteristics of metagenomics data with small sample size, high dimensionality, sparsity, often with a large number of zeros and skewed distribution under the compositions (proportions) setting. ZIBSeq directly handles the proportion data, assuming that proportional dependent variable can be characterized by the beta distribution.

13.2.1.1 ZIBseq Model

Assume x follows a beta distribution: x~Beta(μ, ϕ), then the density of beta distribution (Ferrari and Cribari-Neto 2004) can be described as a function of u and ϕ:

$f\left(\xi, \mu, \phi \right)=\frac{\Gamma \left(\phi \right)}{\Gamma \left(\mu \phi \right)\Gamma \left(\left(1-\mu \right)\phi \right)}{\xi}^{\mu \phi -1}{\left(1-t\right)}^{\left(1-\mu \right)\phi -1},0<\xi <1$

(13.1)

The mean and variance can be parameterized as:

$\left\{\begin{array}{l}E(x)=\mu \\ {}\mathrm{Var}(x)=\mu \left(1-\mu \right)/\left(\phi +1\right)\end{array}\right.,$

(13.2)

where Γ(⋅) is the gamma function, μ(0 < μ < 1) is the mean, and ϕ(ϕ > 0) is a precision parameter.

Beta distribution has a wide range of shapes depending on the values of mean and precision parameters. Thus, beta regression models are very suitable to model the continuous response variables of rates and proportions when their values are restricted to the interval (0,1). To model the variable containing zero or one, Ospina and Ferrari in 2012 (Ospina and Ferrari 2012) proposed a more general class of zero-or-one inflated beta regression models for continuous proportions. Given that microbiome data often contains zero instead of one, ZIBSeq was developed based on the zero-inflated beta regression, assuming the response variable has a mixed continuous-discrete distribution with probability mass at zero.

The zero-inflated beta distribution adds a new parameter α to account for the probability of observations at zero. The subsequent mixture density is given as below:

$Bi\left(x;\alpha, \mu, \phi \right)=\left\{\begin{array}{l}\alpha \\ {}\left(1-\alpha \right)f\left(x;\mu, \phi \right)\end{array}\right.\frac{\mathrm{if}\kern0em x=0}{\mathrm{if}\kern0em 0<x<1},$

(13.3)

where f(x; μ, ϕ) is the beta density in (13.1). The zero-inflated beta and hence ZIBSeq directly model the proportion of the response variables. Thus, the feature abundances are first normalized into the proportion of features. Let ${x}_i^{(j)}$ denote the normalized feature abundance, i.e., the proportion of feature j reads in sample i, ZIBSeq simply normalizes the feature abundance by dividing read counts of each given feature by the total mapped read counts in the sample: ${x}_i^{(j)}={c}_{ij}/{T}_i$ . Thus, the obtained ${x}_1^{(j)},{x}_2^{(j)},\dots, {x}_n^{(j)}$ are proportions of feature j, j = 1, …, m on n samples. By assuming that each ${x}_i^{(j)}$ has a probability density function (13.3) with parameters $\alpha ={\alpha}_i^{(j)}$ , $\mu ={\mu}_i^{(j)}$ , and $\phi ={\phi}_i^{(j)}$ . To fit the parameters in mixture distribution (13.3) for feature j, the ZIBSeq model is defined as below:

$\mathrm{logit}\left({\alpha}_i^{(j)}\right)={\rho}_0^{(j)},\mathrm{logit}\left({\mu}_i^{(j)}\right)={\beta}_0^{(j)}+{\beta}_1^{(j)}{y}_i,\mathrm{and}\ {\phi}_i^{(j)}={T}_1-1.$

(13.4)

In the above equations, ${\rho}_0^{(j)}$ , ${\beta}_0^{(j)}$ , and ${\beta}_1^{(j)}$ are unknown regression parameters to be estimated, while y_i is an outcome measurement indicating the class of sample i, while T_i is the ith sample depth. The dispersion parameter ϕ in (13.4) is approximated by a binomial distribution and a beta distribution under parameterization in (13.1). First, assume the number of reads on a particular feature c follows a binomial distribution with the sample depth T and the probability that the sample reads in this feature μ. Second, let $x=\frac{c}{T}-N\left(\mu, \frac{\mu \left(1-\mu \right)}{T}\right)$ , then, the variance of proportion x will be $\mathrm{Var}(x)=\frac{\mu \left(1-\mu \right)}{T}$ . Third, in the case, the proportion x follows a beta distribution under parameterization in (13.1), the variance will be $\mathrm{Var}(x)=\frac{\mu \left(1-\mu \right)}{\phi +1}$ . An approximation of the dispersion parameter ϕ: ϕ ≈ T − 1 is derived from these two forms of variance.

13.2.1.2 Statistical Hypothesis Testing of Microbiome Composition and Outcome

The statistical hypothesis testing of no significant difference in relative abundance of taxa between class membership or experimental conditions is given as: ${\beta}_1^{(j)}$ = 0. For large samples, the true null distributions can be approximated by a chi-squared distribution. ZIBSeq takes numerical approach to conduct the statistical hypothesis testing of null distributions to find the maximum likelihood estimates in (13.4). The t-statistic and corresponding P-value are obtained by implementing R package GAMLSS (Stasinopoulos and Rigby 2007).

13.2.2 Implement ZIBseq

The microbiome data structure that ZIBseq works on is a n × m matrix (see Table 11.1), with each of n rows presenting samples collected from two or more classes, and each of m columns presenting expression levels of genes or abundance of taxa measured for each sample. Each entry c_ij denotes the number of reads from sample i that mapped to feature (i.e., taxon) j. ${T}_i={\sum}_{j=1}^m{c}_{ij}$ denotes the total number of reads of sample i. Each count c_ij not only depends on the reads of taxon j but also varies from the sequencing depth: the total number of reads of sample i. In Table 13.1, y_i, i = 1, …, n denotes the outcome variable associated with sample i, and it could be class membership of the sample.

Table 13.1

Microbiome data structure for ZIBseq model

	Feature₁	Feature₂	…	Feature_m	Total	Outcome
Sample₁	c₁₁	c₁₂	…	c_1m	T₁	y₁
Sample₂	c₂₁	c₂₂	…	c_2m	T₂	y₂
…			…	…	…	…
Sample_n	c_n1	c_n2	…	c_nm	T_n	y_n

ZIBseq is implemented via the function ZIBseq(). The syntax of function ZIBseq() is given below:

ZIBseq(data, outcome, transform = F, alpha = 0.05).

where the argument:

data is a matrix recording count data, which could be extracted from a data frame with samples (or cases) as rows and taxa (or variables) as columns format recording the sample (or meta) and count data.
outcome is a categorical vector of a clinical condition or treatment.
transform is logical value indicating the square-root transform of the compositional matrix.
alpha is used for customized threshold for calculating Q-values.

After running this function, the significant feature (“sigFeature”), the features that have been concerned (“useFeature”), and both Q-value (“qvalue”) and P-value (“pvalue”) will be provided in the output.

Example 13.1: Vitamin D Receptor (VDR) Knockout Mice

Data is from our study (2015) “Lack of Vitamin D Receptor Causes Dysbiosis and Changes the Functions of the Murine Intestinal Microbiome” (Jin et al. 2015). This dataset has been used in our previous book Statistical Analysis of Microbiome Data with R (Xia et al. 2018b).

The “otu_table_fecal_genus_vdr.csv” is a csv file containing 16S rRNA sequencing data matrix for the study samples. The VDR mice abundance data provides the absolute sequence read counts for a total of 240 taxa at the genus rank level (rows) in 8 samples (columns), preprocessed as described in the main article. Among these eight samples, five samples with VDR were knockout and the remaining three samples were wild types. We are interested in whether dysbiosis and functions of intestinal microbiome are associated with the VDR knockout. Thus, the main metadata is the VDR condition status. We can provide additional metadata, which could be a CSV file containing the sample metadata in the data matrix to run the ZIBSeq. Here, we use commands to extract the grouping information from the sequencing data matrix.

First, set up the working directory and install the latest version of the ZIBseq package (latest version 1.2, June, 2017) in R or RStudio.

> setwd("~/Documents/QIIME2R/Ch13_ZeroInflatedBeta")

> install.packages("ZIBseq")

Then, we can perform ZIBSeq using the following four steps.

Step 1: Load the dataset.

> otu_tab=read.csv("otu_table_fecal_genus_vdr.csv", row.names=1,check.names=FALSE)

> head(otu_tab,3)

5_15_drySt-28F 1_11_drySt-28F 2_12_drySt-28F 3_13_drySt-28F

Tannerella 476 549 578 996

Lactococcus 326 2297 548 2378

Lactobacillus 94 434 719 322

4_14_drySt-28F 7_22_drySt-28F 8_23_drySt-28F 9_24_drySt-28F

Tannerella 404 319 526 424

Lactococcus 471 882 1973 2308

Lactobacillus 205 644 2340 1000

Step 2: Create grouping variable.

We can use the sample ids to create grouping variable.

> otu_tab_t<-t(otu_tab)

> head(otu_tab_t,3)

Tannerella Lactococcus Lactobacillus Lactobacillus::Lactococcus

5_15_drySt-28F 476 326 94 1

1_11_drySt-28F 549 2297 434 25

2_12_drySt-28F 578 548 719 5

> grouping<-data.frame(row.names=rownames(otu_tab_t),t(as.data.frame(strsplit(rownames(otu_tab_t),"_"))))

> grouping

X1 X2 X3

5_15_drySt-28F 5 15 drySt-28F

1_11_drySt-28F 1 11 drySt-28F

2_12_drySt-28F 2 12 drySt-28F

3_13_drySt-28F 3 13 drySt-28F

4_14_drySt-28F 4 14 drySt-28F

7_22_drySt-28F 7 22 drySt-28F

8_23_drySt-28F 8 23 drySt-28F

9_24_drySt-28F 9 24 drySt-28F

> grouping$Group <- with(grouping,ifelse(X2%in% c(11,12,13,14,15),c("Vdr-/-"), c("WT")))

> grouping

X1 X2 X3 Group

5_15_drySt-28F 5 15 drySt-28F Vdr-/-

1_11_drySt-28F 1 11 drySt-28F Vdr-/-

2_12_drySt-28F 2 12 drySt-28F Vdr-/-

3_13_drySt-28F 3 13 drySt-28F Vdr-/-

4_14_drySt-28F 4 14 drySt-28F Vdr-/-

7_22_drySt-28F 7 22 drySt-28F WT

8_23_drySt-28F 8 23 drySt-28F WT

9_24_drySt-28F 9 24 drySt-28F WT

> grouping

[1] "Vdr-/-" "Vdr-/-" "Vdr-/-" "Vdr-/-" "Vdr-/-" "WT" "WT" "WT"

Step 3: Merge grouping and abundance taxa datasets to create a data frame.

> df_G <-as.data.frame(cbind(grouping,otu_tab_t))

> head(df_G,3) grouping Tannerella Lactococcus Lactobacillus

5_15_drySt-28F Vdr-/- 476 326 94

1_11_drySt-28F Vdr-/- 549 2297 434

2_12_drySt-28F Vdr-/- 578 548 719

> class(df_G)

[1] "data.frame"

> row.names(df_G)

[1] "5_15_drySt-28F" "1_11_drySt-28F" "2_12_drySt-28F" "3_13_drySt-28F"

[5] "4_14_drySt-28F" "7_22_drySt-28F" "8_23_drySt-28F" "9_24_drySt-28F"

> ncol(df_G)

[1] 249

> dim(df_G)

[1] 8 249

> head(df_G,3)

The data frame has 249 columns, the first column is grouping variable, the remaining 248 columns are taxa.

Of course, we can create a meta dataset a priori and then load it into R and merge it to taxa abundance dataset.

Then, we can perform ZIBSeq analysis. The following steps including screening features, normalizing and transforming data as well as correcting multiple hypothesis testing are processed automatically by the ZIBSeq() function or through specifying such as to specify transform = TRUE or T will do a square root or cube root transformation. We explicitly list them below for reminding us what are really the function ZIBSeq() working on.

Step 4: Filter low abundant features.

Because taxa with a small number of reads are not reliable, so the function first remove any features with total counts less than 2 time large of the sample size to reduce the effects of noises and measurement errors.

Step 5: Normalize and transform data.

ZIBSeq uses a simple normalization procedure to convert the raw abundance measure to a proportion by dividing each feature read count by the total feature read counts in the sample, which results in relative abundance measure ranging [0,1). After normalization, a square root or cube root transformations is performed to ensure the proportion data are better fitting a beta distribution if the distributions of the proportion are extremely left skewed.

Step 6: Define testing features, construct design matrix and run ZIBseq.

Perform zero-inflated beta regression (Ospina and Ferrari 2012) to predict each normalized feature (response variable) with outcome (explanatory variable); the P-value of the regression coefficient in each regression is obtained.

> library(gamlss)

> library(ZIBseq)

> yy=df_G[,2:249] # specify the 248 taxa

> for (i in 1:ncol(yy)){yy[,i]=as.numeric(as.character(yy[,i]))}

> grp=df_G[,1] # specify the group variable

> grp

[1] "Vdr-/-" "Vdr-/-" "Vdr-/-" "Vdr-/-" "Vdr-/-" "WT" "WT" "WT"

> grp1 <- gsub("Vdr-/-", "1",grp)

> grp2 <- gsub("WT", "0", grp1)

> grp3=as.numeric(grp2)

> grp3

[1] 1 1 1 1 1 0 0 0

> result=ZIBseq(data=yy,outcome=grp3,transform = T, alpha = 0.05)

We can use the summary() function to retrieve the summary information.

> summary(result)

Length Class Mode

sigFeature 2 -none- character

useFeature 1 -none- numeric

qvalues 38 -none- numeric

pvalues 38 -none- numeric

We can also obtain the names of significant features and how many numbers of features have been used in this analysis.

> result$sigFeature

[1] "Lactobacillus" "Acinetobacter"

> result$useFeature

[1] 38

Step 7: Correct multiple hypothesis testing.

We can retrieve the P-values.

> result$pvalues

[1] 1.447e-01 1.982e-01 5.668e-03 1.425e-01 6.503e-01 3.321e-01 1.108e-01

[8] 1.843e-02 4.919e-01 1.495e-01 1.081e-01 7.490e-02 4.471e-01 2.722e-02

[15] 6.060e-01 1.080e-01 5.808e-02 1.603e-01 9.405e-01 5.989e-01 2.484e-01

[22] 3.193e-01 5.835e-01 1.411e-05 7.196e-01 7.299e-02 7.859e-01 2.423e-01

[29] 7.379e-01 1.334e-01 1.693e-01 3.883e-01 9.207e-01 3.591e-01 2.404e-01

[36] 5.300e-01 8.546e-02 6.135e-01

Use the FDR algorithm proposed by Storey and Tibshirani (2003) to estimate a conservative Q-value based on P-values obtained under the assumption that P-values are uniformly distributed.

> result$qvalues

[1] 0.1258969 0.1392339 0.0358341 0.1258969 0.2491919 0.1825697 0.1258969

[8] 0.0776776 0.2303659 0.1258969 0.1258969 0.1258969 0.2174525 0.0860320

[15] 0.2424364 0.1258969 0.1258969 0.1258969 0.3129562 0.2424364 0.1495700

[22] 0.1825697 0.2424364 0.0001784 0.2665825 0.1258969 0.2760442 0.1495700

[29] 0.2665825 0.1258969 0.1258969 0.1964160 0.3129562 0.1891691 0.1495700

[36] 0.2393545 0.1258969 0.2424364

Step 8: Write the results of P-values and Q-values as tables (Table 13.2).

> # Write results table

> # Make the table

> pvalues <- result$pvalues

> qvalues <- result$qvalues

> pq<- data.frame(cbind(pvalues,qvalues))

> library(xtable)

> table <- xtable(pq,caption = "Table of significant taxa",lable="sig_taxa_table")

> print.xtable(table,type="html",file = "ZIBSeq_Table_VDR.html")

> write.csv(pq,file = paste("Results_ZIBSeq_Table_VDR.csv",sep = ""))

Table 13.2

P-values and Q-values from ZIBSeq model

	pvalues	qvalues
1	0.144664728056042	0.125896940419006
2	0.198205269193142	0.139233854649929
3	0.00566793498498382	0.0358341428141313
4	0.142450703458304	0.125896940419006
5	0.650347961806409	0.249191944808199
6	0.332088919006491	0.18256970103783
7	0.110812381135047	0.125896940419006
8	0.018429552251035	0.0776775797080376
9	0.491903363519228	0.230365895238663
10	0.149481602689104	0.125896940419006
11	0.108146886935789	0.125896940419006
12	0.0748980599117641	0.125896940419006
13	0.447131785785204	0.217452473513295
14	0.0272155845567047	0.0860319628188012
15	0.606042815146284	0.242436357692026
16	0.107965948361024	0.125896940419006
17	0.0580814636233615	0.125896940419006
18	0.16028183039337	0.125896940419006
19	0.940513377708146	0.312956183240814
20	0.598904276700515	0.242436357692026
21	0.248405777366409	0.149570020619097
22	0.319334489842356	0.18256970103783
23	0.583495023804528	0.242436357692026
24	1.4105582565407e-05	0.000178358242098557
25	0.719627307693503	0.26658254635945
26	0.0729887026803861	0.125896940419006
27	0.785921203458262	0.27604422186671
28	0.242279775107805	0.149570020619097
29	0.737900153014585	0.26658254635945
30	0.133376612225457	0.125896940419006
31	0.169262963895601	0.125896940419006
32	0.388342311067966	0.196415998030281
33	0.920725205848764	0.312956183240814
34	0.3590536291346	0.189169116242343
35	0.240428856352592	0.149570020619097
36	0.530026294775602	0.239354468760951
37	0.085457319818132	0.125896940419006
38	0.613543801741082	0.242436357692026

13.2.3 Remarks on ZIBSeq

When ZIBSeq (Peng et al. 2016) was proposed in 2015, it was evaluated with various simulated data and compared to the zero-inflated Gaussian (ZIG) model or metagenomeSeq (Paulson et al. 2013), which was proposed in 2013. It has been shown (Peng et al. 2016) that:

First, ZIBSeq has better performance in terms of large AUC values in identifying differential features based on ROC analysis via the R package ROCR (Sing et al. 2009) as well as can identify biologically important taxa in a real microbiome data application.
Second, ZIBSeq outperformed ZIG model based on two arguments: (1) compared to ZIG, ZIBSeq was more effective on the simulated multinomial distribution (MN) and binomial distribution (BI) data, although ZIG method performed slightly better than ZIBSeq with larger AUC values at different sample sizes with simulated data of zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) distributions. (2) ZIBSeq method is more stable, whereas ZIG method is more likely to obtain negative Q-values because it tends to produce more small P-values. Thus, ZIG method may violate the assumption of uniform distribution in the Q-value calculation algorithm proposed by Storey and Tibshirani (2003).
Third, it almost had the same performance whether the response variable (features) is transformed with square root or not transformed in ZIBSeq.
ZIBSeq directly models the proportion after using the Total Sum Scaling (TSS) normalization; in contrast, ZIG models the zero-inflated count data by using the Cumulative Sum Scaling (CSS) normalization (see Sect. 12.1 of Chap. 12); thus, given different approaches of modeling and normalizations used in ZIBSeq and ZIG methods, it is no surprise that ZIBSeq method had better performance for simulated MN, and BI data, while ZIG method had better performance with simulated ZIP and ZINB data.

However, ZIBSeq takes the approach of analyzing relative abundance of individual taxa one by one at a time with a multiple testing correction procedure to control for type I error rate.

This kind of approach was criticized as being not able to incorporate the inter-taxa correlation (Li et al. 2018; Liu and Lin 2018) and cannot provide P-values and correct statistical inferences for the selected taxa (Liu and Lin 2018).
This kind of two-part beta regression model was also criticized that cannot provide a straightforward interpretation of covariate effects on the overall marginal (unconditional) mean (Chai et al. 2018).
Cross-sectional two-part beta regression models including ZIBSeq method can handle zero-inflated proportion data; they cannot deal with repeatedly measured proportion data or longitudinal data (Chen and Li 2016).
ZIBSeq uses GAMLSS package to conduct statistical hypothesis testing the significance, which generates t-statistic and corresponding P-value. Currently the tests are limited for two classes of samples or conditions. For more than two classes, multiple binary recoding is needed before apply zero-inflated beta regression to the recoded outcomes.

13.3 Zero-Inflated Beta-Binomial Model (ZIBB)

ZIBB is another specifically developed model for microbiome data based on zero-inflated beta regression.

13.3.1 Introduction to ZIBB

ZIBB (Hu et al. 2018) was proposed based on zero-inflated beta-binomial model to detect the association of microbiome with a continuous or categorical phenotype of interest. ZIBB model assumes that the count data distribution consists of a mixture of a point mass probability at zero (zero model) and a beta-binomial distribution (count model). Thus, ZIBB model has two components: (1) a zero model accounting for excess zeros and (2) a count model (the remaining component) allowing for over-dispersion effects through beta-binomial regression, which may have an appreciable additional mass at zero. The proposed ZIBB method is an extension of the beta-binomial model of BBSeq (Zhou et al. 2011) that was developed for the analysis of RNA sequence count data. Employing zero-inflated modeling and considering a constrained approach to estimate the over-dispersion parameters have been considered as two major improvements in the proposed ZIBB framework (Hu et al. 2018).

13.3.1.1 Zero and Count Models in ZIBB

ZIBB was proposed for analysis of 16S rRNA sequence count data, a m × n sequence count matrix with columns representing samples and rows representing OTUs (ASVs/Taxa). Let Y = (y_ij) ∈ Ζ^m × n be the count matrix, and each element y_ij represents the count of OTU i in sample j, where (i = 1, …, m, j = 1, …n). Let ${s}_j={\sum}_{i=1}^m{y}_{ij}$ be the library size for sample j. Then, the distribution of microbiome counts data is modeled by a two-stage canonical beta-binomial model, which assuming y_ij follows a binomial distribution Bin(s_j, μ_ij), and μ_ij~Beta(α_1ij, α_2ij). The probability mass function of the beta-binomial distribution with parameter (s_j, α_1ij, α_2ij) is written as f(⋅| α_1ij, α_2ij) or

$f\left({y}_{ij}|{\alpha}_{1 ij},{\alpha}_{2 ij}\right)=\left(\begin{array}{l}{s}_j\\ {}{y}_{ij}\end{array}\right)\frac{B\left({y}_{ij}+{\alpha}_{1 ij},{s}_j-{y}_{ij}+{\alpha}_{2 ij}\right)}{B\left({\alpha}_{1 ij},{\alpha}_{2 ij}\right)}$

(13.5)

In the above formulation, the probability is impliedly dependent on s_j. Since μ_ij~Beta(α_1ij, α_2ij), then the mean and variance of Beta(α_1ij, α_2ij) distribution can be written as:

E(μ_ij) = α_1ij/(α_1ij, α_2ij) and ϕ_iE(μ_ij)(1 − E(μ_ij)). And it is trivial to solve that α_1ij = E(μ_ij)(1 − ϕ_i)/ϕ_i and α_2ij = (1 − E(μ_ij))(1 − ϕ_i)/ϕ_i.

Through reparameterizing the parameters (α_1ij, α_2ij) of the beta distribution to (E(μ_ij), ϕ_i), ϕ_i ≥ 0 in the reparameterized form can be clearly interpreted as the over-dispersion parameter. ZIBB model is a mixture of a point mass at zero (zero model) and a beta-binomial distribution (count model). The density of count y_ij is written as:

${\left.f\left({y}_{ij}|{\alpha}_{1 ij},{\alpha}_{2 ij},{\pi}_{ij}\right)={\pi}_{ij}\right|}_{y_{ij}=0}+\left(1-{\pi}_{ij}\right)f\left({y}_{ij}|{\alpha}_{1 ij},{\alpha}_{2 ij}\right),$

(13.6)

where π_ij is the point mass at zero and f(⋅) is the probability mass function of the canonical beta-binomial distribution as in (13.5). In the above ZIBB model, zero inflation has been taken accounted for by π_ij. Two link functions are used to include the effects of phenotype and covariates in the modeling: one for the zero model and another for the count model. The link function of the zero model is given below:

$\mathrm{logit}\left({\pi}_{ij}\right)=\log \left(\frac{\pi_{ij}}{1-{\pi}_{ij}}\right)={z}_j^T{\eta}_i,$

(13.7)

where z = (z₁, …, z_n)^T ∈ R^n × q is referred to as the design matrix for the zero model. z = (z_{0, j}, …, z_{q − 1, j})^T ∈ R^q is the vector of zero-inflation related covariates for sample j (the z_{0, j} = 1 denotes the intercept) and η_i = (η_{0, i}, …, η_{q − 1, i})^T ∈ R^q is the vector of corresponding coefficients for OTU i. When q = 2, the zero inflation-related covariates are ${z}_j^T=\left(1,\log {s}_j\right)$ . The link function of the count model is given below:

$\mathrm{logit}\left(E\left({\mu}_{ij}\right)\right)=\log \left(\frac{E\left({\mu}_{ij}\right)}{1-E\left({\mu}_{ij}\right)}\right)={X}_j^T{\beta}_i,$

(13.8)

where the X = (X₁, …X_n)^T ∈ R^n × p is referred to as the design matrix for the count model. X_j = (X_{0, j}, …, X_{p − 1, j})^T ∈ R^p is the vector of phenotypes of interest (X_{0, j} = 1 is the intercept of design matrix) for sample j and β_i = (β_{0, i}, …, β_{p − 1, i})^T ∈ R^p is the vector of corresponding coefficients for OTU i.

To model the relationship between the mean and the variance, as well as between the mean and the over-dispersion. ZIBB uses the polynomial fit:

$\mathrm{logit}\left({\phi}_i\right)=\sum \limits_{k=0}^k{\gamma}_k{\left\{\mathrm{mean}\left(X{\beta}_i\right)\right\}}^k$

(13.9)

as the constraint between over-dispersion parameter ϕ_i and coefficients β_i in the count model.

The strategy of considering a constrained approach to model the over-dispersion as a polynomial function of the systematic (mean) component of the generalized linear model was adopted from BBSeq (Zhou et al. 2011). Based on the authors of ZIBB, K = 3 is typically sufficient to effectively model the relationship.

The distinctive feature of ZIBB lies on its modeling the mean-over-dispersion relationship. Thus, the authors of ZIBB called their model and the associated estimation approach as the constrained model and name the model that estimates ϕ_i separately for each OTU i as the free model.

13.3.1.2 Statistical Hypothesis Testing in ZIBB

ZIBB conducts parameter estimation by two steps: First, estimates the parameters in the free model by the maximum likelihood, and then uses the estimates of parameters in the constrained model to fit the polynomial model according to (13.9) and uses least squares solutions to estimate the parameters in γ_k’s. The main purpose of ZIBB is to test associations of OTUs with an experimental phenotypes of interest. Thus, statistical hypothesis testing is to test the statistical significance of β_1i for each OTU i, i = 1, …, m. The null hypothesis is given as: H₀ : β_1i = 0. With P = 2, β_i = (β_0i, β_1i)^T.

The Wald testing statistic is $\frac{{\hat{\beta}}_{1i}}{SE\left({\hat{\beta}}_{1i}\right)}$ . For discrete (e.g., group indicator) phenotype of microbiome count data sets, a t distribution with degrees of freedom n − 2 is used to approximate the distribution of the Wald statistic under the null hypothesis, while for a continuous (e.g., body mass index values) of phenotype, a standard normal is used to approximate the Wald statistic’s null distribution.

13.3.2 Implement ZIBB

ZIBB for the discovery analysis is implemented via the R package ZIBBSeqDiscovery. ZIBBSeqDiscovery uses zero-inflated beta-binomial model to analyze microbiome count data to detect the associate between phenotype of interest and the composition of the counts. The interested covariates can be adjusted via the link functions of zero and count models. The moment corrected correlation (MCC) approach is applied to adjust the P-values.

ZIBBSeqDiscovery takes two approaches to fit the ZIBB model: (1) The free approach treats the over-dispersion parameters for OTUs as independent, and (2) the constrained approach uses a mean-over-dispersion relationship to the count data.

ZIBBSeqDiscovery requires at least three data matrices: a count data matrix, predictor (design) matrix, and zero inflation related matrix. Count data matrix Y (m × n) contains microbiome counts data with m rows and n columns representing m OTUs and n samples. Predictor matrix (design matrix) X (n × p) contains n rows and p column. By default, the first column is the intercept term and the second column denotes the phenotype/covariate we are interested in. Thus, ZIBBSeqDiscovery tests the hypothesis that β_1i = 0. Zero inflation related matrix Z (n × q) contains n rows and q columns. By default, the first column represents for the intercept term. The main function to fit ZIBB model is fitZIBB(). The syntax is given as:

fitZIBB(dataMatrix, X, ziMatrix, mode = "free", gn = 3, betastart = matrix(NA, 0, 0), psi.start = vector(mode = "numeric", length = 0), eta.start = matrix(NA, 0, 0))

where the argument:

dataMatrix is the count matrix (m by n, m is the number of OTUs and n is the number of samples).
X is the design matrix (n by p, p is the number of covariates) for the count model (e.g., beta-binomial), and intercept is included. The second column is assumed to be the covariate of interest.
ziMatrix is the design matrix (n by q, q is the number of covariates) for the zero model, and intercept is included.
mode is used to specify either “free” or “constrained” approach is used to estimate over-dispersion parameters.
gn is used to specify a polynomial with degree of freedom gn to fit the mean-over-dispersion relationship. The default value for gn is 3. It is only used in constrained approach, thus gn is only valid when mode = “constrained”.
betastart is used for specifying the initial values for beta estimation matrix (p by m). Where beta are the effects (or coefficients) for the count model.
psi.start is used to specify the initial values for the logit of over-dispersion parameters vector (with length m).
eta.start is used to specify the initial values for eta estimation matrix (q by m), where eta are the effects (or coefficients) for the zero model.
The arguments psi. betastart, psi.start, and eta.start are required only in constrained approach; they should be assigned as the beta estimation matrix, the psi estimation vector, and the eta estimation matrix, respectively, from free approach.

In the ZIBBSeqDiscovery package, the covariate of interest is referred to as phenotype. Current version (latest version 1.0, March 2018) of this package only considers one phenotype. In the design matrix X, the first column is the intercept, and the phenotype of interest is assumed to corresponding to the second column.

The fitZIBB() function returns several result values, including:

betahat for the estimation matrix of beta (p by m) for count model.
bvar for the estimation matrix of the variance of estimated betahat (p by m).
p for the vector (with length of m) of P-values corresponding to the phenotype, in which the ith P-value corresponds to the hypothesis test of H₀ : β_1i = 0for OTU i.
psi for the estimation vector of the logit of the over-dispersion parameters (with length m).
zeroCoef for the estimation matrix of eta (q by m) for zero model.
gamma for the estimation vector of the coefficients in the mean-over-dispersion relationship in constrained approach (with length gn+1), which is only available when mode = “constrained” is specified.

Example 13.2: Microbiome and Colorectal Carcinoma Data

Dataset is from Kostic et al. (2012) “Genomic analysis identifies association of Fusobacterium with colorectal carcinoma” (Kostic et al. 2012). Sample collection and preparation, amplification, and processing of 16S sequence data were described as in the main article. The data matrix “otu_table_kostic.csv” is a csv file containing 16S rRNA sequencing data which has 2505 OTUs and 185 samples (90 tumor vs. 95 normal colon microbiota). The metadata matrix “meta_table_kostic.csv” is a csv file containing the health status for the samples in otu-table. The interested phenotype here is the health status for samples.

Since the authors of ZIBB (Hu et al. 2018) showed that constrained modeling of ZIBB is better than its free modeling of ZIBB in terms of the type I errors and power, thus we here only report the constrained version of ZIBB using this dataset.

First, install the package ZIBBSeqDiscovery from R CRAN. After the package was installed, we can take seven steps to perform the fitZIBB () function via the ZIBBSeqDiscovery package.

Step 1: Load datasets.

> rm(list = ls())

> setwd("~/Documents/QIIME2R/Ch13_ZeroInflatedBeta")

> otu_tab=read.csv("otu_table_kostic.csv",row.names=1,check.names=FALSE)

> meta_tab=read.csv("meta_table_kostic.csv",row.names=1,check.names=FALSE)

> dim(otu_tab)

[1] 2505 185

> head(otu_tab,3)

C0333.N.518126 C0333.T.518046 X38U4VAHB.518100 XZ33PN7O.518030

304309 40 4 1 2

469478 0 0 0 0

208196 0 0 0 0

We can see that “otu_tab” is a m × n count matrix: m is the number of OTUs (rows) and n is the number of samples(columns). Here there are 2505 taxa and 185 samples.

> dim(meta_tab)

[1] 185 1

> head(meta_tab,3)

Group

1 Healthy

2 Tumor

3 Tumor

Step 2: Filter out OTUs and remove those OTUs that have zero count and few non-zero counts across all samples.

The ZIBB methods may fail to converge for a small proportion of OTUs. This typically occurs when OTUs have few (e.g., four or fewer) non-zero counts and for such OTUs. In such case, the power to detect association with experimental variables is small; thus before modeling using ZIBB methods, those OTUs need to be filtered out. For the remaining small number of OTUs that fail to converge (usually 2% or fewer), the ZIBB methods substitute P-values from the MCC package.

In below analysis, we removed all OTUs with zero counts across all samples. Based on the practice, for the discrete case, OTUs with zero counts across any one of groups were also removed.

> # Remove OTUs that have zero count across all samples

> otu_tab_kostic <- otu_tab[which(rowSums(otu_tab)>0),]

> dim(otu_tab_kostic)

[1] 2490 185

> head(otu_tab_kostic,3)

C0333.N.518126 C0333.T.518046 X38U4VAHB.518100 XZ33PN7O.518030

304309 40 4 1 2

469478 0 0 0 0

208196 0 0 0 0

Step 3: Construct the design matrix and zero inflation related matrix.

We use below commands to construct the design matrix and zero inflation related matrix.

> # Construct the design matrix and zero inflation related matrix

> meta_tab_kostic <- cbind(1, meta_tab=="Tumor")

> zero_tab_kostic <- cbind(1, log(colSums(otu_tab_kostic)))

> head(meta_tab_kostic,3)

Group

1 1 0

2 1 1

3 1 1

> head(zero_tab_kostic,3)

[,1] [,2]

C0333.N.518126 1 8.640

C0333.T.518046 1 7.159

X38U4VAHB.518100 1 8.787

> class(otu_tab_kostic)

[1] "data.frame"

> class(zero_tab_kostic)

[1] "matrix" "array"

> class(meta_tab_kostic)

[1] "matrix" "array"

> otu_tab_kostic<-as.matrix(otu_tab_kostic)

> class(otu_tab_kostic)

[1] "matrix" "array"

> dim(otu_tab_kostic)

[1] 2490 185

> dim(meta_tab_kostic)

[1] 185 2

> dim(zero_tab_kostic)

[1] 185 2

Step 4: Fit the ZIBB model with free approach.

First, we use below commands to fit the ZIBB model with free approach.

> # Fit the ZIBB model with free approach

free_fit <- fitZIBB(otu_tab_kostic, meta_tab_kostic, zero_tab_kostic, mode="free")

Step 5: Fit the ZIBB model with constrained approach.

Then we fit the constrained ZIBB model using the estimation from free approach as initial values.

> # Fit the constrained ZIBB model using the estimation from free approach as initial values

> const_fit <- fitZIBB(otu_tab_kostic, meta_tab_kostic, zero_tab_kostic, mode="constrained",

+ gn=3, betastart=free_fit$betahat,

+ psi.start=free_fit$psi, eta.start=free_fit$zeroCoef)

Depending on the configuration of your computer, the fitZIBB() function may take several minutes to complete the implementation.

Step 6: Use MCC method to adjust the P-values.

Finally, we can employ Moment Corrected Correlation (MCC) method to replace the NAs in the P-values. After running the mcc.adj() function, we will obtain the corrected P-values.

> free_mcc <- mcc.adj(free_fit,otu_tab_kostic, meta_tab_kostic, zero_tab_kostic, K=4)

> const_mcc <- mcc.adj(const_fit,otu_tab_kostic, meta_tab_kostic, zero_tab_kostic, K=4)

NA mostly appears when the OTU counts are zero in most of the samples. In this case, there are 185 samples, thus we want to check the cases such that OTU has 180, 181, 182, 183, and 184 zero counts across the 185 samples, respectively. The following commands can be used to check the effects of MCC adjustment. The NA counts in the reported P-values before and after MCC adjustment will be printed out.

> otu_tab_kostic_0 <- rowSums(otu_tab_kostic==0)

> for (i in 1:5) {

+ idx <- otu_tab_kostic_0 == (185-i)

+ if (i==1) {

+ df_sum <- data.frame(zero.counts = 185-i, N = sum(idx),

+ Num_NA_Infree = sum(is.na(free_fit$p[idx])),

+ Num_NA_Inconst = sum(is.na(const_fit$p[idx])),

+ Num_NA_Infree_MCC = sum(is.na(free_mcc$p[idx])),

+ Num_NA_Inconst_MCC = sum(is.na(const_mcc$p[idx])))

+ } else {

+ df_sum <- rbind(df_sum, c(185-i, sum(idx), sum(is.na(free_fit$p[idx])),

+ sum(is.na(const_fit$p[idx])),

+ sum(is.na(free_mcc$p[idx])),

+ sum(is.na(const_mcc$p[idx]))))

+ }

> print.data.frame(df_sum, right = FALSE)

zero.counts N Num_NA_Infree Num_NA_Inconst Num_NA_Infree_MCC

1 184 731 599 535 236

2 183 334 36 107 0

3 182 214 29 33 0

4 181 146 21 18 0

5 180 113 18 6 0

Num_NA_Inconst_MCC

1 227

2 0

3 0

4 0

5 0

Step 7: Write out the corrected P-values adjusted by the MCC method.

> const_mcc$p

304309 469478 208196 358030 16076 35786 296165

7.073e-08 9.246e-01 9.306e-03 NA 7.102e-01 4.097e-02 9.816e-01

174920 117676 326792 11380 527323 181344 561483

7.271e-01 9.158e-02 7.718e-01 7.295e-02 3.080e-138 1.000e+00 5.420e-02

184450 272955 241674 301062 220782 177005 579750

7.230e-01 9.327e-03 2.110e-01 1.955e-01 5.811e-01 4.682e-01 7.667e-01

469778 149335 249661 110317 345556 348377 205119

2.364e-01 9.912e-01 5.594e-01 3.357e-01 9.739e-127 9.126e-01 5.710e-02

555547 344111 230936 266445 527741 553728 180368

1.124e-01 1.000e+00 9.725e-01 0.000e+00 NA 7.923e-01 2.925e-01

60136 181016 253953 297708 241520 148303 215231

4.473e-01 2.303e-01 4.434e-02 0.000e+00 1.000e+00 NA 1.062e-01

> mccPvalues <- const_mcc$p

> write.csv(mccPvalues,file = paste("Results_ZIBB_Table_MCCadjusted_Pvalues_Kostic.csv",sep = ""))

13.3.3 Remarks on ZIBB

Overall, this study (Hu et al. 2018) showed that both zero inflation and the proposed constraint approaches are vital for accurate and powerful analysis of microbiome count data and ZIBB modeling with the constrained approach is preferred among the competing methods, including BBSeq, edgeR, and ZINB. Especially, it was demonstrated (Hu et al. 2018) that in terms of Type I errors and power for both discrete and continuous cases: (1) ZIBB has higher or comparable performance compared to competing methods. (2) Both the approaches of free and the constrained ZIBB modeling can provide accurate standard error estimates. And (3) the constrained modeling of ZIBB outperformed free modeling of ZIBB. However, the ZIBB methods may fail to converge for a small proportion of OTUs. This converge problem typically occurs when OTUs have four or fewer non-zero counts (Hu et al. 2018).

13.4 Summary

In this chapter we first briefly introduced zero-inflated beta modeling microbiome data and then focused on covering two specifically designed two-part zero-inflated beta regression models for microbiome data: ZIBSeq and ZIBB. We introduced their methodological developments and illustrated their applications with real study data. We also commented on their advantages and limitations. In Chap. 14, we will introduce compositional analysis of microbiome data.

References

Chai, Haitao, Hongmei Jiang, Lu Lin, and Lei Liu. 2018. A marginalized two-part Beta regression model for microbiome compositional data. PLoS Computational Biology 14 (7): e1006329. https://doi.org/10.1371/journal.pcbi.1006329.CrossrefPubMedPubMedCentral
Chen, Eric Z., and Hongzhe Li. 2016. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 (17): 2611–2617. https://doi.org/10.1093/bioinformatics/btw308.CrossrefPubMedPubMedCentral
Ferrari, Silvia, and Francisco Cribari-Neto. 2004. Beta regression for modelling rates and proportions. Journal of Applied Statistics 31 (7): 799–815.Crossref
Hu, Tao, Paul Gallins, and Yi-Hui Zhou. 2018. A zero-inflated beta-binomial model for microbiome data analysis. Stat 7 (1): e185. https://doi.org/10.1002/sta4.185. https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.185.CrossrefPubMed
Jin, Dapeng, Wu Shaoping, Yong-guo Zhang, Lu Rong, Yinglin Xia, Hui Dong, and Jun Sun. 2015. Lack of vitamin D receptor causes dysbiosis and changes the functions of the murine intestinal microbiome. Clinical Therapeutics 37 (5): 996–1009.e7. https://doi.org/10.1016/j.clinthera.2015.04.004. http://www.sciencedirect.com/science/article/pii/S0149291815002283.CrossrefPubMed
Kostic, Aleksandar D., Dirk Gevers, Chandra Sekhar Pedamallu, Monia Michaud, Fujiko Duke, Ashlee M. Earl, Akinyemi I. Ojesina, Joonil Jung, Adam J. Bass, Josep Tabernero, José Baselga, Chen Liu, Ramesh A. Shivdasani, Shuji Ogino, Bruce W. Birren, Curtis Huttenhower, Wendy S. Garrett, and Matthew Meyerson. 2012. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Research 22 (2): 292–298. https://doi.org/10.1101/gr.126573.111. https://pubmed.ncbi.nlm.nih.gov/22009990. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3266036/.CrossrefPubMedPubMedCentral
Li, Z., K. Lee, M. R. Karagas, J. C. Madan, A. G. Hoen, A. J. O'Malley, and H. Li. 2018. Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. Stat Biosciences 10 (3): 587–608. https://doi.org/10.1007/s12561-018-9219-2
Liu, Zhenqiu, and Shili Lin. 2018. Sparse Treatment-Effect Model for Taxon Identification with High-Dimensional Metagenomic Data. In Microbiome Analysis: Methods and Protocols, edited by Robert G. Beiko, Will Hsiao and John Parkinson, 309–318. New York, NY: Springer New York
Ospina, Raydonal, and Silvia L.P. Ferrari. 2012. A general class of zero-or-one inflated beta regression models. Computational Statistics & Data Analysis 56 (6): 1609–1623.Crossref
Paulson, Joseph N., O. Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. Robust methods for differential abundance analysis in marker gene surveys. Nature Methods 10 (12): 1200–1202. https://doi.org/10.1038/nmeth.2658.
Peng, X., G. Li, and Z. Liu. 2016. Zero-inflated beta regression for differential abundance analysis with metagenomics data. Journal of Computational Biology 23. https://doi.org/10.1089/cmb.2015.0157
Sing, Tobias, Oliver Sander, Niko Beerenwinkel, and Thomas Lengauer. 2009. ROCR: Visualizing the performance of scoring classifiers. R Package Version 1, no. 7
Stasinopoulos, D. Mikis, and Robert A. Rigby. 2007. Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software 23 (7): 1–46.Crossref
Storey, John D., and Robert Tibshirani. 2003. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100 (16): 9440–9445.Crossref
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018a. Modeling zero-inflated microbiome data. In Statistical analysis of microbiome data with R, ed. Yinglin Xia, Jun Sun, and Ding-Geng Chen, 453–496. Singapore: Springer Singapore.Crossref
———. 2018b. What are microbiome data? In Statistical analysis of microbiome data with R, ed. Yinglin Xia, Jun Sun, and Ding-Geng Chen, 29–41. Singapore: Springer Singapore.Crossref
Zhou, Yi-Hui, Kai Xia, and Fred A. Wright. 2011. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics 27 (19): 2672–2678.CrossrefPubMedPubMedCentral